Classification of Protein Sequences into Paralog and Ortholog Clusters Using Sequence Similarity Profiles of KEGG/SSDB
نویسندگان
چکیده
We are constructing KEGG/OC (Ortholog Clusters) from KEGG/SSDB (Sequence Similarity DataBase) [2]. KEGG/SSDB contains exhaustive protein sequence similarity scores of completed and nearly completed genomes calculated by the SSEARCH program [3]. KEGG/OC is constructed automatically from the graph analysis of searching cliques with an appropriate definition for the profiles of similarity scores. But there are two major problems to construct the current version of KEGG/OC. First, the current procedure leaves numerous singletons, which should in fact be included in the clusters of related proteins. Second, there are many clusters which contain evolutionarily unrelated proteins. These problems are mainly due to the so-called multi-domain problem. Here, we tried to overcome these problems by using the correlation coefficients for the profiles of KEGG/SSDB sequence similarity scores during classification of paralog genes, and compared our results to the current versions of KEGG/OC and COG [4]. Evaluation of clustering results was made by the degree of consistent motif structuress in the clusters.
منابع مشابه
Automatic generation of KEGG OC (Ortholog Cluster) and its assignment to draft genomes
As the number of sequenced genomes are rapidly growing, a method for automatic generation of orthologous gene clusters is needed. However, it is computationally hard to cluster a large number of genes at once. To address this problem, we have developed a heuristic method to assign gene groups from closely related organisms to an ortholog cluster in a bottom-up approach. In this method, we consi...
متن کاملThe KEGG databases at GenomeNet
The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the primary database resource of the Japanese GenomeNet service (http://www.genome.ad.jp/) for understanding higher order functional meanings and utilities of the cell or the organism from its genome information. KEGG consists of the PATHWAY database for the computerized knowledge on molecular interaction networks such as pathways and comple...
متن کاملSSDB: Sequence Similarity Database in KEGG
Availability of a large number of complete genomes enables us to compare several genomes and to search common and different features between genomes in terms of protein sequence similarities, which we call comparative genomics. It produces information about proteins useful for the assignment of the function to genes and for the research on the evolution of the genome. The large number of genes ...
متن کاملIdentification of Ortholog Groups in KEGG/SSDB by Considering Domain Structures
Huge amount of genome information is stored in databases with the advent of recent genome projects. Although we can effectively predict protein sequences from these genomes, functions of most proteins are not experimentally determined. Therefore computational methods are most important for the function prediction, based on comparison and clustering of protein sequences. However, complications a...
متن کاملClustering of database sequences for fast homology search using upper bounds on alignment score.
Homology data are among the most important information used to predict the functions of unknown proteins and thus fast and accurate methods are needed. In this paper, we propose a new approach for fast and accurate homology search using pre-computed all-against-all similarity scores in a target database. We previously developed a method for derivation of an upper bound of the Smith-Waterman sco...
متن کامل